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Abstract 

We present generalization bounds for the TS-MKL framework for two stage multiple kernel 
learning. We also present bounds for sparse kernel learning formulations within the TS-MKL 
framework. 

1 Introduction 

Recently Kumar et al [6] proposed a framework for two-stage multiple kernel learning that combines 
the idea of target kernel alignment and the notion of a good kernel proposed in [l] to learn a 
good Mercer kernel. More specifically, given a finite set of base kernels Ki, . . . ,Kp over some 
common domain Af, we wish to find some combination of these base kernels that is well suited 
to the learning task at hand. The paper considers learning a positive linear combination of the 
kernels K^^ = X]f=i t^i^i fo^" some £ R^, fJ- > 0. It is assumed that the kernels are uniformly 
bounded i.e. for all xi,X2 G X and i = l...p, we have iCj(xi,X2) < for some Kj > 0. Let 
K = (k^, . . . , Kp) G W. Note that k > 0. Also note that for any /i and any xi,X2 G X, we have 
ir^(xi,X2) < ifJ-jK). 

The notion of suitability used in [6] is that of kernel-goodness first proposed in [Ij for classification 
tasks. For sake of simplicity, we shall henceforth consider only binary classification tasks, the 
extension to multi-class classification tasks being straightforward. We present below the notion 
of goodness used in [6]. For any binary classification task over a domain X characterized by a 
distribution D over X x {±1}, a Mercer kernel K : X x X ^M. with associated Reproducing Kernel 
Hilbert Space T-Lk and feature map '■ X — )• Tix is said to be (e,7)-/cerne/ good if there exists a 
unit norm vector w G Hk such that ||w||^^ = 1 and the following holds 
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2 Learning a Good Kernel 



The key idea behind [6] is to try and learn a positive linear combination of kernels that is good 
according to the notion presented above. We define the risk functional TZ{-) : i— )• as follows: 

7^(/x):= E hl-yy'K^M] 

A combination /x will be said to be e- combination good if TZ{n) < e. The quantity TZ{^x) is of 
interest since an application of Jensen's inequality (see [6] Lemma 3.2]) shows us that for any /x > 

that is e-combination good, the kernel K^j, is ^e, ^ -kernel good. Furthermore, one can show, 

using standard results on capacity of linear function classes (see for example [2l Theorem 21]), 
that an (e, 7)-good kernel can be used to learn, with confidence 1 — 5, a. classifier with expected 

misclassification rate at most e + ei by using at most O ^^^^ log labeled samples. 

In order to cast this learning problem more cleanly, |6] proposes the construction of a K-space 
using the following feature map 

(x,x')^ (i^l(x,x'),...,i^p(x,x')) €W 

This allows us to write, for any /x G RP, i^^(x, x') = (/x, z(x, x')). Given n labeled training points 
(xi, yi), . . . , {xn,yn), define the empirical risk functional TZ{-) : RP ^ R+ as follow^Q: 

' l<i<j<n 

[6] poses the learning problem as the following optimization problem: 

min — \\u\\n + T^f/x) 

3 Generalization Guarantees for a Learned Kernel Combination 

Our generalization guarantee shall proceed in two steps. We shall assume that we have with us a 
training set (xi, yi), . . . , (x„, y„) using which we are able to determine a combination vector such 
that TZifi) < i. 

1. We shall first prove that, with high probability over the choice of the training points, the 
learned combination vector /x will give us a kernel Kp_ that is + ei, -^J^^ -kernel good 
where ei > is a quantity that can be made arbitrarily small. 



^We note that [6] includes the terms [1 — (/i, 2;(xi, Xi))]^ into the empirical risk as well. This does not change 
the asymptotics of our analysis except for causing a bit of notational annoyance. In order to account for this term, 
the true risk functional will have to include an additional term 7?.add(M) ^ [[1 ~ '^)] + ] • This will add 

a negligible term to the uniform convergence bound because we will have to consider the convergence of the term 
7?.add(/i) nf^n+i) Yli<i<n ~ (/^i •^(xi , Xi ) )] to 7?.add . Howcver, from thereou, the analysis will remain unaffected 
since 7?.add(Ai) > so a combination fj, having true risk TZ{fJ,) + 7?.add(/.t) < e will still give a kernel that is 
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2. We shall then prove that given that there exists a good combination of kernels in the K-space, 
with very high probability e will be very small. This we will prove by showing a converse of 
the inequality proved in the first step. This will allow us to give oracle inequalities for the 
kernel goodness of the learned combination. 



3.1 Step 1 



In this step, we prove a uniform convergence guarantee for the learning problem at hand. Using 
standard proof techniques, we shall reduce the problem of uniform convergence to that of estimating 
the capacity of a certain function class. The notion of capacity we shall use is the Rademacher 
complexity which we shall bound using the heavy hammer of strong convexity based bounds from 
[5]. We note that the proof progression used in this step is fairly routine within the empirical 
process community and has been used to give generalization proofs for other problems as well (see 
for example O |4|). 

First of all we note that due to the optimization process we liav^ 

^ IIAII2 < ^ IIAII2 + < llo| 

which implies that we need only concern ourselves with combination vectors inside the L2 ball of 
radius r\ = ^J^. 

B2 {rx) ■.= {^l£W ■.M\^<rx} 

For notational simplicity, we denote z = (x, y) as a training sample. For any training set zi, . . . , Zjj 
where Zj = (xi,yi) and for any fi £ MP, we write £(/x,Zj,Zj) := [1 — yiyj {fi, z{xi,Xj))]j^. We 
assume, yet again for the sake of notational simplicity, that we obtain at all times, an even number 
of training samples i.e. n is even. For a ghost sample zi, . . . , z„ then, we can write 



7t(0) = 1 
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Thus we can write 

7^(A) - 7t(A) 
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l<j<j<n 



^Any tolerance eopt offered by the optimizer can easily be incorporated into the bounds. However, we do not do 
so for sake of clarity. 



3 



For any fj, € B2{rx) and any xi,X2 G X, we have A'^(xi,X2) < (/^, k) < ||f^^||2- Using this, it is 
not difficult to see that the expression g (zi, . . . , z„) can be perturbed by at most ^ (1 + ||'^||2) by 
the change of a single true training sample Zj = (xj,yj) (see [6l Theorem 3.4] for the calculations). 
Applying McDiarmid's inequality to this expression, we get with probability at least 1 — 5, 



7^(A) - < E [9 (zi, . . . , z„)i + (1 + ll^cll 



We now estimate the the expectation term on the right hand side. 
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J2 ({fJ-,Zi,Zj 

l<i<j<n 



^ i{fl,Zi,Zj 

l<i<j<n 



X] ^(M'^i,Zj)- ^ £(/X,Zi,Zj) 

M6B2(rA) I l<i<j<n l<i<j<n 



We now invoke a powerful alternate representation for U-statistics to simplify the above expression. 
This method can be found in fJ] that itself attributes this method to [8]. This, along with the 
Hoeffding decomposition, are two of the most powerful techniques to deal with "coupled" random 
variables as we have in this situation. 

Theorem 1 ([4j, Lemma A.l). For any set of real valued functions q-r ■ X x X ^ M indexed by 
T eT, if Xi, . . . , Xn are i. i. d. random variables then we have 
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Applying this decoupling result to the random variables Xj = (zj,Zj), the index set B2{rx) and 
functions qT-{Xi, Xj) = i{p,,Zi,Zj) — £(/x, Zj,Zj) we get 



E[<7(zi,...,z„)l < -E 
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V ei Zj, Z„/2+i) - ^(/^, Zi, Zn/2+i)) 
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{n/2 
Vei£(//,Zi,Z„/2+i) 
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r n/2 

sup S X] ~ yiyn/2+i ( ^(Xj , X„/2+i 
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where in the second step, we performed symmetrization on the decoupled expression by introducing 
Rademacher random variables ei,i = 1, . . . ,n/2. In the fifth step we have applied the contraction 
inequality stated in Theorem [2] below on the 1-Lipschitz function : x i— [1 — Ojx]^ where 
ci-i = yiyn/2+i- We have exploited the fact that Theorem [2] actually proves the contraction inequality 
for the empirical Rademacher averages which allows us to treat Oj as constants dependent only on 



Theorem 2. LetH he a set of hounded real valued functions from some domain ?C and let xi, • • • , 
he arbitrary elements from X . Furthermore, let (pi : M ^ M., i = 1, . . . ,n he L-Lipschitz functions. 
Then we have 
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Proof. Ledoux and Talagrand (see \T, Theorem 4.12]) prove the same result but for wrapper func- 
tions that satisfy (f)i{0) = for all i. To get the result, simply apply the result to the functions 
(pi-.x^ (j)i{x) - 0j(O) to get 
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where we apply [7j Theorem 4.12] to the first term and the second term vanishes by linearity of 
expectation. □ 

The concluding term in the last chain of inequalities gives us the Rademacher complexity of the 
hypothesis class B2{rx). At this point we introduce the following result on Rademacher complexities 
of regularized linear predictor classes 



Theorem 3 ([5j, Theorem 1). Let W he a closed convex set and let F : W 



he X-strongly 



Assume W C |w : F(w) < W^^}. Furthermore, let = {x : ||x|| < X} and 



convex w.r.t. \\- 
J'W '■= {w I— ?• (w,x) : w G W,x G X}. Then, we have 




Although [5] make their claim for the normal Rademacher average but their proof actually gives 

bounds for the empirical Rademacher averages. Since our hypothesis class is L2 regularized, we 

can apply Theorem [3] to the L2/L2 case with F{fi) = \\n\\2 as the regularizer. Since we have 

sup ||2;(xi,X2)||2 < ||'^|l2i S^t 
xi,X2eA:' 
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We have thus proved the following result 



Theorem 4. With probability at least 1 — 6 over the choice of training samples, the minimizer fi 
of the expression 

min — WuWl + TZ(u,) 
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satisfies the following 



Since fj, G fi2(?'A)) we have < This imphes that the kernel Kp_ is at least 

e + ei, ji^jp -^/l'^ -kernel good where e = 'R.{fi) and ei < 6 ||ft||2 \/ ^°xn ■ particular, if all the p 
kernels share a common bound i.e. ki < k for all i, then ||k||2 < k^-^/p and we can show the kernel 



Kf, to be e + V -kernel good 



An ^ K^y 2p 

3.2 Step 2 

Just as we analyzed the excess risk expression TZ{fi) — uniformly over vectors the ball B2{r\), 

we can similarly analyze the expression — TZ{fJ,) uniformly over any (fixed) ball B2{r) to get 

the following result. 

Theorem 5. Let r > be some fixed radius, then with probability at least 1 — 5 over the choice of 
training samples, all combination vectors fi £ B2{r) satisfy 



TZifi) <n{fi) + 2r\\K\\^^+{l + r\\K\\^) ' 
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This allows us to give the following oracle inequality: 

Theorem 6. Suppose as an oracle assumption we assume that there exists a good combination 
vector /Xq that is Co-combination good, then we can output with probability at least 1 — S, for any 

ei > using n = Q ( ^^^-r^ ) training samples, a combination vector such that the corresponding 



kernel that is f e„ + ei , „ I — u-x/W] -kernel qood. 

Proof. Using Theorem [5] we have with probability at least 1 — S, 
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T^ifJ-o) < eo + 2||AtJ|2 ll^lla \ - + (1 + HMoHs Il«^ll2) \ < + ^WfJ-oh W'^h 
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Since fi is the minimizer of the regularized empirical risk, we have 



^ 2 ~ ^ 2 ' ^2 /2 log T 

2 IIAII2 + ■^(A) < 2 \\^^o\\2 + ■^(/^o) < 2 \\^^o\\2 + eo + 6 II//0II2 Il'^ll2 Y ^ 

which gives us, since WfiWo > 0, 



.,2 , n n„n . 21ogi 
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Applying Theorem!?] we get with probability at least 1 — 25, 



A „ „2 /21ogT „ „ /log 4 

For any < ei < 3/4, setting A = and requiring n > ^ ||/^oll2 Il'^ll2^°§i ^° ^^^^ three 

terms in the above expression are less than ei/3 gives us the result (for values of ei larger than 3/4, 
n > ^ II/I0II2 Il'^ll2 log ^ suffices). □ 

Such oracle inequalities are very desirable since they tell us that we would be able to give a 
performance that is competitive against any fixed kernel in foresight. If we set A to an oracle 

oblivious value such as A = then although we get an inferior claim with respect to the kernel- 
goodness, we are able to make that claim in hindsight as well. 

4 Learning Sparse Kernel Combinations 

Since the complexity of the evaluating the kernel ET^ goes up roughly as ||/^||q, it is desirable to 
learn sparse combinations. This can be done by changing the learning formulation slightly to the 
following: 

+ ^ [1-2/.2/,(a^,z(x„x,)>] + 

i<«<i<" 

The above learning algorithm can also shown to admit generalization guarantees. For sake of brevity 
we only give below the main points where the analysis differs from the L2 regularized case. First of 
all, we would be able to show that the regularized empirical risk minimizer fi would lie in the Li 
ball Bi{sx) := {n eW : ||/i||^ < sx} where s\ = j. 

Due to this the perturbations to the expression g (zi, . . . , z^) would be limited by ^ (1 + s\ Wi^W^)- 
While applying Theorem [31 we would instead consider the regularizer F{fi) = ||/^||q for q = 1^°^^^ 

which is ^jj^^ -strongly convex with respect to the norm \\-\\i- This would allow us to bound the 
Rademacher complexity of the hypothesis class Bi{sx) as 



/K f \\ ^ II II /21ogp 2 / 21ogp 
This allows us to make the following claim: 

Theorem 7. With probability at least 1 — 6 over the choice of training samples, the minimizer jx 
of the expression 

A „ „ ^ 

mm 77 ll/^lli +^(/^) 



M>o 2 



satisfies the following 
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Since [i G Bi{s\), we have {fi, k) < ^ °° . This impHes that the kernel Kp_ is ^ e + ei, 2||^ 

6||k;|| 



kernel good where e = 7t(/i) and ei < ^'^^ yyiogp + y^log l/Sj . In particular, if all the p kernels 
share a common bound i.e. Ki < k for all i, then H/^Hoo < and we can show the kernel Kp^ to be 
+ 1^ (Vlogp + ^log 1/5) , -kernel good. 

Note that this result has a much better dependence on p that the result for L2 regularized 
learning where we were only able to show that the kernel Kp, was + y'^^^^^j \J^^^ '^'^^^^^ 
good. 

We can also show the following version of Theorem [5] to be true 

Theorem 8. Let s > be some fixed radius, then with probability at least 1 — 5 over the choice of 
training samples, all combination vectors jj, G Bi{s) satisfy 



7^(M) < n{^.) + 2s ./^+ (i + sM' - ^ 



n \ n 



Using this, and going as before, we are also able to guarantee the following oracle inequality 
similar to Theorem [6] 

Theorem 9. Suppose as an oracle assumption we assume that there exists a good combination 
vector fj,^ that is Eq- combination good, then we can output with probability at least 1—5, for any 
ei > using n = Q ( training samples, a combination vector such that the corresponding 



kernel that is \^eo + ei, 3| | k;| | | | j -kernel good. 

Proof. Following the chain of inequalities given by Theorems [7] and |8] and using optimality of the 
regularized empirical risk minimizer fi, we get, with probability at least 1 — 25, 

A 6||k||^ (Vlogp+ Vlogl/'^j / 1 



■7^(A) ^ + 2 + 7^ ( IImoIIi + ^ 



2 135||Ato||il|fcl|^(^/Iogp+ A/log l/<5) 

Setting A = „|| ^'^u and requiring n > ^ finishes the proof. □ 

5 Discussion on the Nature of Guarantees 

The guarantees given above, both for the sparse as well as the non-sparse kernel learning cases are 
slightly unsatisfactory in the sense they assume combination goodness to ensure kernel goodness. In 
other words they assume the existence of a combination that is e-combination before guaranteeing 
that the output would be a kernel that is (e', 7')-kernel good. Ideally, we should have used the 
promise of existence of a kernel that (e, 7)-kernel good to ensure that a good kernel is output. 

One way to prove such a result would be to show that if there exists a kernel combination that 
is (e,7)-kernel good, then there also exists some combination fi £ MP that is e'-combination good 
for some e' > 0. However, this is an unlikely result and the the aim of this section is to discuss this 
point. It turns out that the biggest hurdle that one faces in proving such a result is the form of 
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combination goodness chosen by f6]. The definition of combination goodness used in [6] is related 
to the notion of similarity goodness proposed in [T] except for the absence of a weight function. 

More specificaUy, [Ij consider a kernel ivT^ to be e-similarity good if for some weight function 
to : — )• M the following holds: 



E 



1 



< e 



For ease of comparison, we have absorbed the margin parameter 7 in the definition given in [Ij into 
the weight function w{-). Note that if the notion of combination goodness had been defined using 



E 

(x,J/),(x',j(') 



[1 



yy w 



instead, then one could have used some form of inverse Jensen's inequality to convert similarity 
goodness into combination goodness. Since the presence of the weight function makes it possible 
for crisp conversions of kernel goodness into similarity goodness as was done in f9l| , this could have 
been one way to convert kernel goodness into combination goodness (i.e. via similarity goodness). 
However, due to the absence of such weight functions, it seems difficult to convert kernel goodness 
into combination goodness using the methods of [9]. 

Another reason to believe in the non-existence of such conversions from kernel to combina- 
tion goodness is the form of the predictor in the RKHS. If one looks at the proof of Lemma 
3.2 in [6j then one notices that the kernel goodness is proven with respect to the predictor w = 

E (x) where ^-Hj^ ■ X 1— t- Hk^^ is the feature map corresponding to the kernel K^J,. 

This turns out to be very a restrictive form for the predictor. A kernel can be good due to the 
existence of any unit norm predictor in its RKHS. However the notion of combination goodness 
seems to prefer predictors that point from the mean of the images of the negative points to the 
mean of the images of the positive points in the RHKS. It was noted in |lj that such a notion of 
goodness is too strong ([1] actually call this the strongly- good notion of similarity goodness) and 
that there exist kernels that are very good with respect to the learning task at hand but the uniform 

vectors w = E v'^Hk (^) their RKHSes perform poorly (see [ll Definition 2] and the 

{x,j/)~© iL 11 

discussion thereafter) . 

Thus it seems unlikely that the current proof technique can be extended to accept promises of 
kernel goodness. The technique seems inherently suited to accept combination goodness and output 
good kernels. It would be interesting to see whether the existing proofs can be modified or whether 
the algorithms can be modified to accommodate kernel goodness. 



References 

[1] Maria-Fiorina Balcan and Avrim Blum. On a Theory of Learning with Similarity Functions. In 

International Conference on Machine Learning, 2006. 

[2] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian Complexities: Risk Bounds 
and Structural Results. Journal of Machine Learning Research, 3:463-482, 2002. 

[3] Qiong Cao, Zheng-Chu Guo, and Yiming Ying. Generalization Bounds for Metric and Similarity 
Learning, 2012. arXiv:1207.5437. 



9 



[4] Stephan Clemengon, Gabor Lugosi, and Nicolas Vayatis. Ranking and empirical minimization 
of U-statistics. Annals of Statistics, 36:844-874, 2008. 



[5] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the Complexity of Linear Pre- 
diction: Risk Bounds, Margin Bounds, and Regularization. In Annual Conference on Neural 

Information Processing Systems, 2008. 

[6] Abhishek Kumar, Alexandru Niculescu-Mizil, Koray Kavukcuoglu, and Hal Daumc III. A Binary 
Classification Framework for Two-Stage Multiple Kernel Learning. In International Conference 
on Machine Learning, 2012. 

[7] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. 
Springer, 2002. 

[8] Robert J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980. 

[9] Nathan Srcbro. How Good Is a Kernel When Used as a Similarity Measure? In 20th Annual 
Conference on Computational Learning Theory, pages 323-335, 2007. 



10 



